{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mitigating Disparities in Ranking from Binary Data\n",
    "_**An example based on the Law School Admissions Council's National Longitudinal Bar Passage Study**_\n",
    "\n",
    "\n",
    "## Contents\n",
    "\n",
    "1. [What is Covered](#What-is-Covered)\n",
    "1. [Overview](#Overview)\n",
    "1. [Data](#Data)\n",
    "1. [Unmitigated Predictor](#Unmitigated-Predictor)\n",
    "1. [Mitigating Demographic Disparity with Grid Search](#Mitigating-Demographic-Disparity-with-Grid-Search)\n",
    "   1. [Comparing Probabilistic Predictors](#Comparing-Probabilistic-Predictors)\n",
    "1. [Obtaining Low-Disparity Classifiers](#Obtaining-Low-Disparity-Classifiers)\n",
    "   1. [Postprocessing](#Postprocessing)\n",
    "   1. [Exponentiated Gradient](#Exponentiated-Gradient)\n",
    "   1. [Comparing Classifiers](#Comparing-Classifiers)\n",
    "\n",
    "\n",
    "## What is Covered\n",
    "\n",
    "* **Domain:**\n",
    "  * Education (law-school admissions). Please review the usage notes at the end of [Overview](#Overview).\n",
    "  \n",
    "* **ML tasks:**\n",
    "  * Prediction of the probability of success in a bar-passage exam based on binary classification data.\n",
    "  * Binary classification.\n",
    "  * Ranking based on probabilistic predictions.\n",
    "\n",
    "* **Fairness tasks:**\n",
    "  * Assessment of unfairness using Fairlearn metrics.\n",
    "  * Mitigation of unfairness using Fairlearn mitigation algorithms.\n",
    "\n",
    "* **Performance metrics:**\n",
    "  * Area under ROC curve.\n",
    "  * Worst-case area under ROC curve.\n",
    "  * Balanced accuracy.\n",
    "\n",
    "* **Fairness metrics:**\n",
    "  * Demographic parity difference (for both binary and continuous predictions).\n",
    "\n",
    "* **Mitigation algorithms:**\n",
    "  * `fairlearn.reductions.ExponentiatedGradient`\n",
    "  * `fairlearn.reductions.GridSearch`\n",
    "  * `fairlearn.postprocessing.ThresholdOptimizer`\n",
    "\n",
    "## Overview\n",
    "\n",
    "We consider the task of ranking students for admission to law school using the data collected in [Law School Admissions Council's (LSAC) National Longitudinal Bar Passage Study](https://eric.ed.gov/?id=ED469370); specifically, the version downloaded from [Project SEAPHE](http://www.seaphe.org/databases.php). We highlight some of the fairness considerations that come up not only in school admissions, but also in other ranking scenarios. Necessarily, our example is simplified and ignores many real-world considerations specific to school admissions.\n",
    "\n",
    "The data set contains information about law students collected by LSAC between 1991 and 1997. Some of the information is available at the admission time (such as the undergraduate GPA and LSAT score), and some describes the performance of the students once admitted. We also have access to their self-identified race. To simplify this example, we will limit the attention to those self-identified as **black** and **white** (two largest groups) and restrict our attention to two features (undergraduate GPA and LSAT score).\n",
    "\n",
    "To help with ranking law school applicants, we train a model that uses the information that is available about a student at the admission time to predict the probability that they will pass their bar exam. The predictions of our model are intended to be used (among other factors) by admission officers to select the applicants. After training the initial model, we examine differences in the predictions it induces across two the two groups. We then mitigate these differences using three Fairlearn algorithms: `GridSearch`, `ThresholdOptimizer` and `ExponentiatedGradient`.\n",
    "\n",
    "**Usage notes:** This notebook is intended as an example of Fairlearn functionality and not a fully realistic case study of an admission scenario. In real world, one should think carefully about whether it is appropriate to rank or score individuals. Also, additional features beyond the two considered here (GPA and LSAT scores) should be considered in practice, as recommended by the authors of the [LSAC study](https://eric.ed.gov/?id=ED469370). Finally, in real-world settings, it would be inappropriate to restrict attention to only two of the subgroups without evaluating the impacts on other individuals."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "\n",
    "We download the data from SEAPHE. We filter the set of students to black and white and split them into training and test subsets. The training and test data sets are loaded in three parts:\n",
    "\n",
    "* **X_train**, **X_test**: features describing the training and test data; we keep two features: `ugpa` (undegraduate GPA) and `lsat` (LSAT score)\n",
    "\n",
    "* **y_train**, **y_test**: labels of the training and test data; the labels are 0 or 1, indicating whether a student passed the bar exam by the 2nd attempt\n",
    "\n",
    "* **A_train**, **A_test**: self-identified race of each student (black or white)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import requests\n",
    "import tempfile\n",
    "import zipfile\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "from IPython.display import display, HTML\n",
    "\n",
    "def load_lawschool_data(target):\n",
    "    \"\"\" Downloads SEAPHE lawschool data from the SEAPHE webpage.\n",
    "    For more information refer to http://www.seaphe.org/databases.php\n",
    "\n",
    "    :param target: the name of the target variable, either pass_bar or zfygpa\n",
    "    :type target: str\n",
    "    :return: pandas.DataFrame with columns\n",
    "    \"\"\"\n",
    "    if target not in ['pass_bar', 'zfygpa']:\n",
    "        raise ValueError(\"Only pass_bar and zfygpa are supported targets.\")\n",
    "\n",
    "    with tempfile.TemporaryDirectory() as temp_dir:\n",
    "        response = requests.get(\"http://www.seaphe.org/databases/LSAC/LSAC_SAS.zip\")\n",
    "        temp_file_name = os.path.join(temp_dir, \"LSAC_SAS.zip\")\n",
    "        with open(temp_file_name, \"wb\") as temp_file:\n",
    "            temp_file.write(response.content)\n",
    "        with zipfile.ZipFile(temp_file_name, 'r') as zip_ref:\n",
    "            zip_ref.extractall(temp_dir)\n",
    "        data = pd.read_sas(os.path.join(temp_dir, \"lsac.sas7bdat\"))\n",
    "\n",
    "        # data contains 'sex', 'gender', and 'male' which are all identical except for the type;\n",
    "        # map string representation of feature \"sex\" to 0 for Female and 1 for Male\n",
    "        data = data.assign(gender=(data[\"gender\"] == b\"male\") * 1)\n",
    "\n",
    "        # filter out all records except the ones with the most common two races\n",
    "        data = data.assign(white=(data[\"race1\"] == b\"white\") * 1)\n",
    "        data = data.assign(black=(data[\"race1\"] == b\"black\") * 1)\n",
    "        data = data[(data['white'] == 1) | (data['black'] == 1)]\n",
    "\n",
    "        # encode dropout as 0/1\n",
    "        data = data.assign(dropout=(data[\"Dropout\"] == b\"YES\") * 1)\n",
    "\n",
    "        if target == 'pass_bar':\n",
    "            # drop NaN records for pass_bar\n",
    "            data = data[(data['pass_bar'] == 1) | (data['pass_bar'] == 0)]\n",
    "        elif target == 'zfygpa':\n",
    "            # drop NaN records for zfygpa\n",
    "            data = data[np.isfinite(data['zfygpa'])]\n",
    "\n",
    "        # drop NaN records for features\n",
    "        data = data[np.isfinite(data[\"lsat\"]) & np.isfinite(data['ugpa'])]\n",
    "\n",
    "        # Select relevant columns for machine learning.\n",
    "        # We explicitly leave in age_cat to allow linear classifiers to be non-linear in age\n",
    "        # TODO: consider using 'fam_inc', 'age', 'parttime', 'dropout'\n",
    "        data = data[['white', 'black', 'gender', 'lsat', 'ugpa', target]]\n",
    "\n",
    "    return data\n",
    "\n",
    "# Extract into X, y and A\n",
    "data = load_lawschool_data('pass_bar')\n",
    "X = data[['lsat', 'ugpa']]\n",
    "y = data['pass_bar']\n",
    "A = data['white'].apply(lambda x: 'white' if x==1 else 'black').rename('race')\n",
    "\n",
    "# Split into test and train, making sure we have sequential indices in the results\n",
    "X_train, X_test, y_train, y_test, A_train, A_test = \\\n",
    "    train_test_split(X, y, A, test_size=0.33, random_state=123)\n",
    "\n",
    "X_train = X_train.reset_index(drop=True)\n",
    "y_train = y_train.reset_index(drop=True)\n",
    "A_train = A_train.reset_index(drop=True)\n",
    "X_test = X_test.reset_index(drop=True)\n",
    "y_test = y_test.reset_index(drop=True)\n",
    "A_test = A_test.reset_index(drop=True)\n",
    "\n",
    "# Combine all training data into a single data frame and glance at a few rows\n",
    "all_train = pd.concat([X_train, y_train, A_train], axis=1)\n",
    "display(all_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let us examine the data more closely. We look at the distributions of `lsat` and `ugpa` by race (summarized via quartiles), and compare them with the bar passage rates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_train_grouped = all_train.groupby('race')\n",
    "\n",
    "counts_by_race = all_train_grouped[['lsat']].count().rename(\n",
    "    columns={'lsat': 'count'})\n",
    "\n",
    "quartiles_by_race = all_train_grouped[['lsat','ugpa']].quantile([.25, .50, .75]).rename(\n",
    "    index={0.25: \"25%\", 0.5: \"50%\", 0.75: \"75%\"}, level=1).unstack()\n",
    "\n",
    "rates_by_race = all_train_grouped[['pass_bar']].mean().rename(\n",
    "    columns={'pass_bar': 'pass_bar_rate'})\n",
    "\n",
    "summary_by_race = pd.concat([counts_by_race, quartiles_by_race, rates_by_race], axis=1)\n",
    "display(summary_by_race)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The majority of the students in the study are white. There is a notable gap between white and black students in their incoming academic credentials: the 75th percentile of the LSAT scores of black students is lower than the 25th percentile of the LSAT scores among white students. There is a less severe, but still substantial gap in UGPA. The achievement gap is greatly diminished in terms of the bar passage rate (78% for black students and 97% for white students). The authors of the [LSAC study](https://eric.ed.gov/?id=ED469370) conclude that this justifies admission practices that look beyond LSAT and UGPA. However, in this simplified example, we build predictors of bar passage from these two variables alone."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Unmitigated Predictor\n",
    "\n",
    "We first train a standard logistic regression predictor that does not seek to incorporate any notion of fairness."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "unmitigated_predictor = LogisticRegression(solver='liblinear', fit_intercept=True)\n",
    "unmitigated_predictor.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We view the probabilistic predictions produced by the logistic model as scores and evaluate the quality of the ranking they produce in terms of the [area under the ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve) (AUC). AUC is equal to the probability that a randomly chosen positive example (i.e., a student who passes the bar) is scored above a randomly chosen negative example (i.e., a student who does not pass the bar). An AUC of 0.5 means that the scores are no better than a random coin flip, whereas AUC of 1.0 means that the scores perfectly separate positives from negatives. The AUC metric has two desirable properties: (1) it is preserved by monotone transformations of the score, and (2) it is not sensitive to the imbalance between positives and negatives, which is quite severe in our example, with the overall bar passage rate above 94%.\n",
    "\n",
    "Note that the logistic regression estimator above does not seek to optimize AUC directly, but only seeks to optimize the logistic loss. However, a good logistic loss is also expected to yield a good AUC.\n",
    "\n",
    "To obtain the AUC values for the overall student population as well as black and white subpopulations, we use the **group metric** variant of the `sklearn` metric [`roc_auc_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sklearn.metrics as skm\n",
    "from fairlearn.metrics import MetricFrame, selection_rate\n",
    "\n",
    "# a convenience function that transforms the result of a group metric call into a data frame\n",
    "def summary_as_df(name, summary):\n",
    "    a = summary.by_group\n",
    "    a['overall'] = summary.overall\n",
    "    return pd.DataFrame({name: a})\n",
    "\n",
    "scores_unmitigated = pd.Series(unmitigated_predictor.predict_proba(X_test)[:,1], name=\"score_unmitigated\")\n",
    "metric_frame_unmitigated = MetricFrame(skm.roc_auc_score,\n",
    "                                       y_test, scores_unmitigated, sensitive_features=A_test)\n",
    "auc_unmitigated = summary_as_df(\"auc_unmitigated\", metric_frame_unmitigated)\n",
    "\n",
    "display(HTML('<span id=\"auc_unmitigated\">'),\n",
    "        auc_unmitigated,\n",
    "        HTML('</span>'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We next examine how the unmitigated predictor affects applicants of different race when it is used to score them. We plot the CDFs of the scores it generates for each group. We then consider all possible thresholds on the value of the score, and for each threshold check the fraction of black vs white students above the threshold. The largest observed difference across all possible thresholds is referred to as the **demographic parity difference** or **demographic disparity** (see [Agarwal et al. 2018](http://proceedings.mlr.press/v97/agarwal19d.html), where it is reffered to as SP disparity). Pictorially, this corresponds to the largest vertical difference between the two CDFs. Note that this disparity metric is preserved under monotone transformations of the scores."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy.stats import cumfreq\n",
    "\n",
    "def compare_cdfs(data, A, num_bins=100):\n",
    "    cdfs = {}\n",
    "    assert len(np.unique(A)) == 2\n",
    "    \n",
    "    limits = ( min(data), max(data) )\n",
    "    s = 0.5 * (limits[1] - limits[0]) / (num_bins - 1)\n",
    "    limits = ( limits[0]-s, limits[1] + s)\n",
    "    \n",
    "    for a in np.unique(A):\n",
    "        subset = data[A==a]\n",
    "        \n",
    "        cdfs[a] = cumfreq(subset, numbins=num_bins, defaultreallimits=limits)\n",
    "        \n",
    "    lower_limits = [v.lowerlimit for _, v in cdfs.items()]\n",
    "    bin_sizes = [v.binsize for _,v in cdfs.items()]\n",
    "    actual_num_bins = [v.cumcount.size for _,v in cdfs.items()]\n",
    "    \n",
    "    assert len(np.unique(lower_limits)) == 1\n",
    "    assert len(np.unique(bin_sizes)) == 1\n",
    "    assert np.all([num_bins==v.cumcount.size for _,v in cdfs.items()])\n",
    "    \n",
    "    xs = lower_limits[0] + np.linspace(0, bin_sizes[0]*num_bins, num_bins)\n",
    "    \n",
    "    disparities = np.zeros(num_bins)\n",
    "    for i in range(num_bins):\n",
    "        cdf_values = np.clip([v.cumcount[i]/len(data[A==k]) for k,v in cdfs.items()],0,1)\n",
    "        disparities[i] = max(cdf_values)-min(cdf_values)  \n",
    "    \n",
    "    return xs, cdfs, disparities\n",
    "    \n",
    "    \n",
    "def plot_and_compare_cdfs(data, A, num_bins=100, loc='best'):\n",
    "    xs, cdfs, disparities = compare_cdfs(data, A, num_bins)\n",
    "    \n",
    "    for k, v in cdfs.items():\n",
    "        plt.plot(xs, v.cumcount/len(data[A==k]), label=k)\n",
    "    \n",
    "    assert disparities.argmax().size == 1\n",
    "    d_idx = disparities.argmax()\n",
    "    \n",
    "    xs_line = [xs[d_idx],xs[d_idx]]\n",
    "    counts = [v.cumcount[d_idx]/len(data[A==k]) for k, v in cdfs.items()]\n",
    "    ys_line = [min(counts), max(counts)]\n",
    "    \n",
    "    plt.plot(xs_line, ys_line, 'o--')\n",
    "    disparity_label = \"max disparity = {0:.3f}\\nat {1:0.3f}\".format(disparities[d_idx], xs[d_idx])\n",
    "    plt.text(xs[d_idx], 1, disparity_label, ha=\"right\", va=\"top\")\n",
    "    \n",
    "    plt.xlabel(data.name)\n",
    "    plt.ylabel(\"cumulative frequency\")\n",
    "    plt.legend(loc=loc)\n",
    "    plt.show()\n",
    "\n",
    "display(HTML('<span id=\"disparity_unmitigated\">'))\n",
    "plot_and_compare_cdfs(scores_unmitigated, A_test)\n",
    "display(HTML('</span>'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the largest disparity of about 0.6 occurs at the threshold value 0.94: only 23% of black students, but 83% of white students are above this threshold."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mitigating Demographic Disparity with Grid Search\n",
    "\n",
    "We next show how to mitigate the demographic disparity using the `GridSearch` algorithm of Fairlearn. We will use this algorithm to obtain several models that achieve various trade-offs between accuracy (measured by AUC) and demographic disparity.\n",
    "\n",
    "The `GridSearch` variant that we will use was developed for *classification* under demographic parity, but the experiments of\n",
    "[Agarwal et al. (2018)](http://proceedings.mlr.press/v97/agarwal19d.html) show that it also performs well for *logistic regression* (viewed as the probability prediction) under demographic parity. While the resulting logistic models mitigate the demographic disparity, they might not be well calibrated (unlike unmitigated logistic models), so we use Platt's scaling for [calibration](https://scikit-learn.org/stable/modules/calibration.html). Note that Platt's scaling is a monotone transformation, and so it has no effect on the AUC values or the demographic disparity of the resulting model. However, it makes the predicted scores interpretable as probabilities.\n",
    "\n",
    "`GridSearch` generates models corresponding to various Lagrange multiplier vectors of the underlying constraint optimization problem. We will compute 41 models on a grid of Lagrange multiplier vectors whose L1-norm is bounded by 10. For details on how the search works, refer to Section 3.4 of [Agarwal et. al (2018)](http://proceedings.mlr.press/v80/agarwal18a.html). The following cell may take a couple of minutes to run:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fairlearn.reductions import GridSearch, DemographicParity\n",
    "from sklearn.calibration import CalibratedClassifierCV\n",
    "\n",
    "sweep = GridSearch(LogisticRegression(solver='liblinear', fit_intercept=True),\n",
    "                   constraints=DemographicParity(),\n",
    "                   grid_size=41,\n",
    "                   grid_limit=10)\n",
    "\n",
    "sweep.fit(X_train, y_train, sensitive_features=A_train)\n",
    "\n",
    "calibrated_predictors = []\n",
    "for predictor in sweep.predictors_:\n",
    "    calibrated = CalibratedClassifierCV(base_estimator=predictor, cv='prefit', method='sigmoid')\n",
    "    calibrated.fit(X_train, y_train)\n",
    "    calibrated_predictors.append(calibrated)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We next assess the accuracy and disparity of the obtained predictors in a scatter plot, with *x* axis showing the worst-case AUC among the two subpopulations (of black and white students) and *y* axis showing the demographic disparity. Ideal models would be in the bottom right."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def roc_auc_score_group_min(y_true, y_pred, sensitive_features, sample_weight=None):\n",
    "    mf = MetricFrame(skm.roc_auc_score,\n",
    "                     y_true, y_pred,\n",
    "                     sensitive_features=sensitive_features,\n",
    "                     sample_params = {'sample_weight': sample_weight})\n",
    "    return mf.group_min()\n",
    "\n",
    "def auc_disparity_sweep_plot(predictors, names, marker='o', scale_size=1, zorder=-1):\n",
    "    roc_auc = np.zeros(len(predictors))\n",
    "    disparity = np.zeros(len(predictors))\n",
    "    \n",
    "    for i in range(len(predictors)):\n",
    "        preds = predictors[i].predict_proba(X_test)[:,1]\n",
    "        roc_auc[i] = roc_auc_score_group_min(y_test, preds, sensitive_features=A_test)\n",
    "        _, _, dis = compare_cdfs(preds, A_test)\n",
    "        disparity[i] = dis.max()\n",
    "        \n",
    "    plt.scatter(roc_auc, disparity,\n",
    "                s=scale_size * plt.rcParams['lines.markersize'] ** 2, marker=marker, zorder=zorder)\n",
    "    for i in range(len(roc_auc)):\n",
    "        plt.annotate(names[i], (roc_auc[i], disparity[i]), xytext=(3,2), textcoords=\"offset points\", zorder=zorder+1)\n",
    "    plt.xlabel(\"worst-case AUC\")\n",
    "    plt.ylabel(\"demographic disparity\")\n",
    "    \n",
    "auc_disparity_sweep_plot(calibrated_predictors, names=range(len(calibrated_predictors)))\n",
    "auc_disparity_sweep_plot([unmitigated_predictor], names=[''], marker='*', zorder=1, scale_size=5)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Model 33 has the lowest disparity, but its worst-case AUC is essentially the same as that of a coin flip. The unmitigated model, marked as a star, has a good worst-case AUC, but large disparity. We examine models 35 and 36: their AUC values are well above 0.6 and they substantially reduce the demographic disparity compared with the unmitigated model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "scores_model35 = pd.Series(calibrated_predictors[35].predict_proba(X_test)[:,1], name=\"score_model35\")\n",
    "scores_model36 = pd.Series(calibrated_predictors[36].predict_proba(X_test)[:,1], name=\"score_model36\")\n",
    "\n",
    "metric_frame_model35 = MetricFrame(metrics=skm.roc_auc_score, y_true=y_test, y_pred=scores_model35, sensitive_features=A_test)\n",
    "metric_frame_model36 = MetricFrame(metrics=skm.roc_auc_score, y_true=y_test, y_pred=scores_model36, sensitive_features=A_test)\n",
    "\n",
    "auc_model35 = summary_as_df(\"auc_model35\", metric_frame_model35)\n",
    "auc_model36 = summary_as_df(\"auc_model36\", metric_frame_model36)\n",
    "\n",
    "display(HTML('<span id=\"grid_search_comparison\">'),\n",
    "        pd.concat([auc_model35, auc_model36, auc_unmitigated], axis=1),\n",
    "        HTML('</span>'))\n",
    "plot_and_compare_cdfs(scores_model35, A_test)\n",
    "plot_and_compare_cdfs(scores_model36, A_test)\n",
    "plot_and_compare_cdfs(scores_unmitigated, A_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Obtaining Low-Disparity Classifiers\n",
    "\n",
    "In this section, we shift attention from the task of scoring and ranking students to the task of automatically classifying students, for example, in order to screen them for an interview or a deeper review of their application materials. Our goal is to obtain a _classifier_ that maximizes AUC while respecting demographic parity.\n",
    "\n",
    "The outputs of a classifier are either 0 or 1, so it is possible to re-interpret the AUC of a classifier as the *balanced accuracy*, meaning the accuracy under the distribution re-weighted to have the same mass of positive and negative examples. Demographic disparity can also be interpreted as the difference between the rates at which the students of either race are classified as 1; we refer to this rate as the _selection rate_.\n",
    "\n",
    "### Postprocessing\n",
    "\n",
    "We first show how to obtain low-disparity classifiers by thresholding scores&mdash;such as the scores produced by unmitigated logistic regression&mdash;using the postprocessing algorithm of [Hardt et al. (2016)](https://arxiv.org/abs/1610.02413), implemented in the class  `ThresholdOptimizer`. This algorithm finds thresholds that optimize accuracy subject to the constraint that there be no demographic disparity on the training data. Since our goal here is to optimize _balanced_ accuracy rather than accuracy, we first re-balance the data by randomly subsampling positive examples, so they are equal in number to negative examples. We then pass this re-balanced data set to `ThresholdOptimizer`. Since the accuracy of a classifier on the re-balanced data set is in expectation equal to the AUC on the original data, `ThresholdOptimizer` now seeks to optimize our desired accuracy metric."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.base import BaseEstimator, ClassifierMixin\n",
    "from fairlearn.postprocessing import ThresholdOptimizer\n",
    "\n",
    "# We want to apply ThresholdOptimizer to the probabilities returned\n",
    "# by the unmitigated logistic regression predictor. Since ThresholdOptimizer\n",
    "# applies thresholding to the output of predict(), but LogisticRegression\n",
    "# returns probabilities (of both classes) in predict_proba(), we need to\n",
    "# use the following wrapper for LogisticRegression.\n",
    "\n",
    "class LogisticRegressionAsRegression(BaseEstimator, ClassifierMixin):\n",
    "    def __init__(self, logistic_regression_estimator):\n",
    "        self.logistic_regression_estimator = logistic_regression_estimator\n",
    "    \n",
    "    def fit(self, X, y):\n",
    "        self.logistic_regression_estimator.fit(X, y)\n",
    "        return self\n",
    "    \n",
    "    def predict(self, X):\n",
    "        # use predict_proba to get real values instead of 0/1, select only prob for 1\n",
    "        scores = self.logistic_regression_estimator.predict_proba(X)[:,1]\n",
    "        return scores\n",
    "\n",
    "\n",
    "balanced_index_pass0 = y_train[y_train==0].index \n",
    "balanced_index_pass1 = y_train[y_train==1].sample(n=balanced_index_pass0.size, random_state=0).index\n",
    "balanced_index = balanced_index_pass0.union(balanced_index_pass1)\n",
    "\n",
    "pp_estimator = ThresholdOptimizer(\n",
    "    estimator=LogisticRegressionAsRegression(unmitigated_predictor),\n",
    "    constraints=\"demographic_parity\",\n",
    "    prefit=True)\n",
    "\n",
    "pp_estimator.fit(X_train.iloc[balanced_index,:], y_train.iloc[balanced_index],\n",
    "                 sensitive_features=A_train.iloc[balanced_index])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We next evaluate AUC (balanced accuracy) and demographic disparity (disparity in selection rates) of the black and white students on the test data; note that we use the actual test data (not a re-balanced version, which we only used for training purposes)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fairlearn.metrics import mean_prediction\n",
    "\n",
    "scores_pp = pd.Series(pp_estimator.predict(X_test, sensitive_features=A_test), name=\"scores_post\")\n",
    "metric_frame_auc_pp = MetricFrame(metrics=skm.roc_auc_score, y_true=y_test, y_pred=scores_pp, sensitive_features=A_test)\n",
    "metric_frame_sel_pp = MetricFrame(metrics=mean_prediction, y_true=y_test, y_pred=scores_pp, sensitive_features=A_test)\n",
    "auc_pp = summary_as_df(\"auc_post\", metric_frame_auc_pp)\n",
    "sel_pp = summary_as_df(\"selection_post\", metric_frame_sel_pp)\n",
    "\n",
    "pp_summary = pd.concat([auc_pp, sel_pp], axis=1)\n",
    "pp_summary.loc['disparity']=(pp_summary.loc['white']-pp_summary.loc['black']).abs()\n",
    "pp_summary.loc['disparity', pp_summary.columns.str.startswith('auc')]='-'\n",
    "\n",
    "display(pp_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The original unmitigated scores have the demographic disparity around 0.6 (see [here](#disparity_unmitigated)). We see that `ThresholdOptimizer` dramatically reduces the disparity to around 0.1. At the same time, the AUC in each subpopulation is at or above 0.65, a moderate drop from the unmitigated values of 0.72 and 0.74 (see [here](#auc_unmitigated)). This is a more favorable trade-off than the one achieved by model 35 above, with the disparity of 0.4 and the worst-case AUC of around 0.62 (see [here](#grid_search_comparison)). However, note that `ThresholdOptimizer` is a classifier, and so it can only work as a crude ranker. Additionally, `ThresholdOptimizer` uses the sensitive feature (in this instance race) at the prediction time, by applying a different threshold to unmitigated scores depending on race. In some use cases, these two properties might be undesirable. We next show how to obtain a classifier that also seeks to achieve low demographic disparity, but without requiring access to the sensitive feature at the evaluation time.\n",
    "\n",
    "*Note*: `ThresholdOptimizer` produces randomized predictions, so the AUC and selection rate of postprocessing will vary if you re-run the cell above. Also, while `ThresholdOptimizer` is guaranteed to achieve zero demographic disparity on its training data, this does not mean it will achieve zero demographic disparity on the test data for several reasons: (1) the training data is balanced whereas test data is not, so test data comes from a different distribution than training data; (2) even if training and test data were coming from the same distribution, there would be some differences due to finite sample sizes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exponentiated Gradient\n",
    "\n",
    "`ExponentiatedGradient` also seeks to find a classifier that optimizes accuracy while placing a constraint on the demographic disparity. However, it operates as a *reduction* to standard classification, taking any estimator as a black box. During its run it repeatedly re-fits the estimator on variously reweighted training data and eventually produces a randomized classifier of the same type as the provided black-box estimator. This means that if the black box does not have access to the sensitive feature, neither will the predictor fitted by `ExponentiatedGradient`.\n",
    "\n",
    "We next train two classifiers via `ExponentiatedGradient`. Both use `LogisticRegression` as a black box. However, one has only access to the original features (**X_train** and **X_test**), whereas the other one also has access to the sensitive features, which we include in the extended feature set (**XA_train** and **XA_test**). Both classifiers optimize AUC subject to the constraint that demographic disparity on training data is at most 0.01. We also set the convergence parameter `nu` to `1e-6` to optimize to numerical precision (the default is to optimize to statistical precision, which we override here)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fairlearn.reductions import ExponentiatedGradient\n",
    "\n",
    "XA_train = pd.concat([X_train, A_train=='black'], axis=1).astype(float)\n",
    "XA_test = pd.concat([X_test, A_test=='black'], axis=1).astype(float)\n",
    "\n",
    "expgrad_X = ExponentiatedGradient(\n",
    "    LogisticRegression(solver='liblinear', fit_intercept=True),\n",
    "    constraints=DemographicParity(difference_bound=0.01),\n",
    "    eps=0.01,\n",
    "    nu=1e-6)\n",
    "expgrad_XA = ExponentiatedGradient(\n",
    "    LogisticRegression(solver='liblinear', fit_intercept=True),\n",
    "    constraints=DemographicParity(difference_bound=0.01),\n",
    "    eps=0.01,\n",
    "    nu=1e-6)\n",
    "\n",
    "expgrad_X.fit(\n",
    "    X_train.iloc[balanced_index,:],\n",
    "    y_train.iloc[balanced_index],\n",
    "    sensitive_features=A_train.iloc[balanced_index])\n",
    "expgrad_XA.fit(\n",
    "    XA_train.iloc[balanced_index,:],\n",
    "    y_train.iloc[balanced_index],\n",
    "    sensitive_features=A_train.iloc[balanced_index])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores_expgrad_X = pd.Series(expgrad_X.predict(X_test), name=\"scores_expgrad_X\")\n",
    "scores_expgrad_XA = pd.Series(expgrad_XA.predict(XA_test), name=\"scores_expgrad_XA\")\n",
    "\n",
    "metric_frame_auc_expgrad_X = MetricFrame(metrics=skm.roc_auc_score, y_true=y_test, y_pred=scores_expgrad_X, sensitive_features=A_test)\n",
    "metric_frame_mean_expgrad_X = MetricFrame(metrics=mean_prediction, y_true=y_test, y_pred=scores_expgrad_X, sensitive_features=A_test)\n",
    "metric_frame_auc_expgrad_XA = MetricFrame(metrics=skm.roc_auc_score, y_true=y_test, y_pred=scores_expgrad_XA, sensitive_features=A_test)\n",
    "metric_frame_mean_expgrad_XA = MetricFrame(metrics=mean_prediction, y_true=y_test, y_pred=scores_expgrad_XA, sensitive_features=A_test)\n",
    "auc_expgrad_X = summary_as_df(\"auc_expgrad_X\", metric_frame_auc_expgrad_X)\n",
    "sel_expgrad_X = summary_as_df(\"selection_expgrad_X\", metric_frame_mean_expgrad_X)\n",
    "auc_expgrad_XA = summary_as_df(\"auc_expgrad_XA\", metric_frame_auc_expgrad_XA)\n",
    "sel_expgrad_XA = summary_as_df(\"selection_expgrad_XA\", metric_frame_mean_expgrad_XA)\n",
    "\n",
    "classifier_summary = pd.concat([auc_pp, sel_pp, auc_expgrad_X, sel_expgrad_X, auc_expgrad_XA, sel_expgrad_XA], axis=1)\n",
    "classifier_summary.loc['disparity']=(classifier_summary.loc['white']-classifier_summary.loc['black']).abs()\n",
    "classifier_summary.loc['disparity', classifier_summary.columns.str.startswith('auc')]='-'\n",
    "display(classifier_summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that exponentiated gradient variants generally achieve lower disparity on this data than `ThresholdOptimizer`. Without access to the sensitive feature at the test time, this comes at the cost of bringing the AUC essentially to that of a random coin toss (AUC of **expgrad_X** is close to 0.5). With access to the sensitive feature, the overall AUC is comparable to that achieved by `ThresholdOptimizer`, but `ThresholdOptimizer` achieves a better worst-case AUC across the two population. \n",
    "\n",
    "*Note*: `ExponentiatedGradient` produces randomized predictions (similarly to `ThresholdOptimizer`), so the AUC and selection rate will vary if you re-run the cell above. Also, because of a mismatch between the training and test distributions and because of finite samples, we do not expect `ExponentiatedGradient` to achieve test disparity equal to 0.01.\n",
    "\n",
    "We next show that if we are willing to tolerate a larger demographic disparity, it is possible to achieve non-trivial AUC values even without access to the sensitive feature. We run `ExponentiatedGradient` with the bound on the training disparity equal to 0.3:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "expgrad_X_alt = ExponentiatedGradient(\n",
    "    LogisticRegression(solver='liblinear', fit_intercept=True),\n",
    "    constraints=DemographicParity(difference_bound=0.3),\n",
    "    eps=0.3, # This has changed from 0.01 in the above examples\n",
    "    nu=1e-6)\n",
    "\n",
    "expgrad_X_alt.fit(\n",
    "    X_train.iloc[balanced_index,:],\n",
    "    y_train.iloc[balanced_index],\n",
    "    sensitive_features=A_train.iloc[balanced_index])\n",
    "\n",
    "scores_expgrad_X_alt = pd.Series(\n",
    "    expgrad_X_alt.predict(X_test), name=\"scores_expgrad_X_alt\")\n",
    "\n",
    "metric_frame_auc_expgrad_X_alt = MetricFrame(metrics=skm.roc_auc_score, y_true=y_test, y_pred=scores_expgrad_X_alt, sensitive_features=A_test)\n",
    "metric_frame_mean_expgrad_X_alt = MetricFrame(metrics=mean_prediction, y_true=y_test, y_pred=scores_expgrad_X_alt, sensitive_features=A_test)\n",
    "auc_expgrad_X_alt = summary_as_df(\"auc_expgrad_X_alt\", metric_frame_auc_expgrad_X_alt)\n",
    "sel_expgrad_X_alt = summary_as_df(\"selection_expgrad_X_alt\", metric_frame_mean_expgrad_X_alt)\n",
    "\n",
    "auc_expgrad_X_alt.loc['disparity'] = '-'\n",
    "sel_expgrad_X_alt.loc['disparity'] = (sel_expgrad_X_alt.loc['white'] - sel_expgrad_X_alt.loc['black']).abs()\n",
    "\n",
    "display(pd.concat([auc_expgrad_X_alt, sel_expgrad_X_alt], axis=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Comparing Classifiers\n",
    "\n",
    "We finish this section by comparing the four predictors above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "metric_frames_auc = {\n",
    "    \"postprocessing\": metric_frame_auc_pp,\n",
    "    \"expgrad_X_alt\": metric_frame_auc_expgrad_X_alt,\n",
    "    \"expgrad_X\": metric_frame_auc_expgrad_X,\n",
    "    \"expgrad_XA\": metric_frame_auc_expgrad_XA\n",
    "}\n",
    "metric_frames_sel = {\n",
    "    \"postprocessing\": metric_frame_sel_pp,\n",
    "    \"expgrad_X_alt\": metric_frame_mean_expgrad_X_alt,\n",
    "    \"expgrad_X\": metric_frame_mean_expgrad_X,\n",
    "    \"expgrad_XA\": metric_frame_mean_expgrad_XA\n",
    "}\n",
    "x = [metric_frame.group_min() for metric_frame in metric_frames_auc.values()]\n",
    "y = [metric_frame.difference() for metric_frame in metric_frames_sel.values()]\n",
    "keys = list(metric_frames_auc.keys())\n",
    "plt.scatter(x, y)\n",
    "for i in range(len(x)):\n",
    "    plt.annotate(keys[i], (x[i] + 0.003, y[i]))\n",
    "plt.xlabel(\"worst-case AUC\")\n",
    "plt.ylabel(\"selection rate difference\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}